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Abstract — This paper proposes a novel latent semantic learning 
method for extracting high-level features (i.e. latent semantics) 
from a large vocabulary of abundant mid-level features (i.e. visual 
keywords) with structured sparse representation, which can help 
to bridge the semantic gap in the challenging task of human 
action recognition. To discover the manifold structure of mid- 
level features, we develop a spectral embedding approach to 
latent semantic learning based on L\ -graph, without the need 
to tune any parameter for graph construction as a key step of 
manifold learning. More importantly, we construct the L\ -graph 
with structured sparse representation, which can be obtained by 
structured sparse coding with its structured sparsity ensured 
by novel Li-norm hypergraph regularization over mid-level 
features. In the new embedding space, we learn latent semantics 
automatically from abundant mid-level features through spectral 
clustering. The learnt latent semantics can be readily used for 
human action recognition with SVM by defining a histogram 
intersection kernel. Different from the traditional latent semantic 
analysis based on topic models, our latent semantic learning 
method can explore the manifold structure of mid-level features 
in both L\ -graph construction and spectral embedding, which 
results in compact but discriminative high-level features. The 
experimental results on the commonly used KTH action dataset 
and unconstrained YouTube action dataset show the superior 
performance of our method. 

Index Terms — Human action recognition, latent semantic 
learning, spectral embedding, structured sparse representation, 
Li-norm hypergraph regularization. 



I. Introduction 

Automatic recognition of human actions in videos has a 
wide range of applications such as video summarization, 
human-computer interaction, and activity surveillance. Al- 
though many impressive results have been reported on human 
action recognition, it still remains a challenging problem 
HI owing to viewpoint changes, occlusions, and background 
clutters. In the literature, one direct strategy is to measure 
how humans are moving in the scene, using the techniques 
for tracking or body pose estimation Q-Q]. However, a 
distinct limitation of this strategy is that it requires reliable 
tracking or body pose estimation, which is difficult for realistic 
videos. Another more effective strategy adopts an intermediate 
representation based on spatio-temporal interest points 11511-1171 
to bridge the semantic gap between low-level spatio-temporal 
features and high-level action categories. In particular, recent 
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work has shown promising results when the local spatio- 
temporal descriptors are used for bag-of-words (BOW) models 
[8|-[11|, where the local features are quantized to form a 
visual vocabulary and each video clip is thus summarized as 
a histogram of visual keywords. In the following, we refer to 
the visual keywords as mid-level features to distinguish them 
from the low-level features and high-level action categories. 

However, this BOW representation may suffer from the 
redundancy of mid-level features, since typically thousands 
of visual keywords are formed to obtain better performance 
on a relatively large action dataset Ifl2l . Here, it should be 
noted that the large vocabulary size means that the BOW 
representation would incur large time cost in not only vocab- 
ulary formation but also later action recognition. Moreover, 
the mid-level features are applied to human action recognition 
independently and mainly the first-order statistics is consid- 
ered. Intuitively, the higher-order semantic correlation between 
mid-level features is very useful for bridging the semantic 
gap in human action recognition. Although the semantic 
information can be incorporated into the visual vocabulary 
using either local descriptor annotation or video annotation, 
the manual labeling is too expensive and tedious for a large 
action dataset. Therefore, to reduce the redundancy of mid- 
level features, this paper focuses on automatically extracting 
high-level features (or latent semantics) that are compact in 
size but more discriminative in terms of descriptive power. 

Previously, several unsupervised methods |[T3ll . |[T4l have 
been developed to learn latent semantics based on topic 
models, such as probabilistic latent semantic analysis (PLSA) 
lTT5l and latent Dirichlet allocation (LDA) (TfQ. A mixture 
of latent topics is used to model each video, and the topics 
are learnt as multinomial distributions of mid-level features. 
Moreover, information theory has also been applied to latent 
semantic analysis for human action recognition in ifTTll . [18|. 
The success of these models may be due to the fact that the 
semantically similar mid-level features generally have a higher 
probability of co-occurring in a video across the entire dataset. 
It should be noted that, besides this simple co-occurring 
information, there also exists more complicated semantically 
similar information, e.g., the mid-level features generated from 
similar video contents tend to lie in the same geometric 
or manifold structure. However, this intrinsic information is 
not considered by the latent topic or information theoretic 
models (13), 0U, 07), [18). In the literature, very few 
attempts have been made to explicitly preserve the manifold 
geometry of the mid-level feature space when learning high- 
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Fig. 1, The flowchart of human action recognition using the latent semantics (i.e. high-level features) learnt by spectral embedding based on the L\ -graph 
constructed with structured sparse representation. 



level latent semantics from the abundant mid-level features. 
To our best knowledge, |fl9ll is the first attempt to extract 
latent semantics from videos for human action recognition 
using a manifold learning technique based on diffusion maps 
[ 20 1 . Although this diffusion map method has been shown to 
achieve better results than the information theoretic models in 
1 19 1, it requires fine parameter tuning for graph construction 
which can significantly affect the performance and has been 
noted as an inherent weakness of graph-based methods. 

To address the above problems associated with human 
action recognition, we propose a novel latent semantic learning 
method based on spectral embedding ||2TI - Il23ll with Li-graph, 
without the need to tune any parameter for graph construction 
as a key step of manifold learning. More importantly, we 
construct the L\ -graph with structured sparse representation, 
which can be obtained by structured sparse coding with 
its structured sparsity being ensured by novel Li-norm hy- 
pergraph regularization over mid-level features. A distinct 
advantage of characterizing the similarity between mid-level 
features based on structured sparse representation is that we 
can collect a sparse affinity matrix in a parameter-free manner. 
In contrast, since the mid-level features are represented as 
vectors of point-wise mutual information and their similarity 
is typically characterized via a Gaussian function in [19], the 
choice of the variance in this function has been shown to affect 
the performance of human action recognition significantly. To 
summarize, through spectral embedding based on Li-graph, 
we can discover more intrinsic manifold structure hidden 
among mid-level features and thus learn more compact but 
discriminative latent semantics by spectral clustering in the 
new embedding space, which has been shown in our later 
experiments. In this paper, we focus on parameter-free Li- 
graph construction and only consider the commonly used 
spectral embedding method f2D , regardless of many other 
manifold learning techniques in the literature. 

Since our new Li-norm hypergraph regularization can 
ensure the structured sparsity in L\ -graph construction, we 
discuss it in detail as follows. Although derived from the 
traditional Laplacian regularization ll24ll - ||26l , our Li-norm 
hypergraph regularization is more suitable for parameter-free 
Li-graph construction as an Li-norm term (see Section [TITb. 
More importantly, we can exploit the manifold structure of 
mid-level features for graph construction and simultaneously 
introduce another important type of sparsity by Li-norm hy- 
pergraph regularization, which is the main difference between 
our structured sparse coding and the traditional sparse coding 



(27)-||29]. In mis P a P er > the hypergraph ||30l-||32l used for 
our Li-norm hypergraph regularization is also constructed in 
a parameter-free manner. That is, each video that contains 
multiple mid-level features is regarded as a hyperedge, and 
its weight can be estimated based on the original cluster 
centers associated with mid-level features. Although both 
spare representation and hypergraph are also used in our short 
conference version [29], the present paper has integrated them 
in a unified structured sparse representation framework. Since 
our Li-norm hypergraph regularization can be applied to many 
other machine learning problems (considering the wide use 
of Laplacian regularization), the present paper has made a 
significant extra contribution as compared to ||29l . In addition, 
the proposed new structured sparse coding can certainly be 
considered as another extra contribution. 

In this paper, we apply the learnt latent semantics to 
human action recognition with SVM by defining a histogram 
intersection kernel. The flowchart of human action recognition 
is illustrated in Fig. Q] which contains four components: 
extraction of low-level spatio-temporal descriptors the same 
as %\, formation of mid-level features by fc-means clustering, 
extraction of high-level latent semantics by spectral embedding 
based on Li-graph, and action classification with SVM. We 
have tested our method on the commonly used KTH action 
dataset [5 1 and unconstrained YouTube action dataset [ 18 1. The 
experimental results demonstrate the superior performance of 
our method for human action recognition. To emphasize the 
main contributions of this paper, we summarize the following 
advantages of our method: 

(1) Our method can learn compact but discriminative 
latent semantics by exploring the manifold structure 
of visual keywords in both Li-graph construction and 
spectral embedding, which is quite different from the 
traditional latent semantic analysis based on topic 
models. 

(2) This is the first attempt to develop novel structured 
sparse coding for latent semantic learning in the chal- 
lenging task of human action recognition, although 
many efforts have already been made to apply sparse 
coding to other difficult tasks in the literature. 

(3) Our new Li-norm hypergraph regularization can in- 
corporate the manifold structure of mid-level features 
into graph construction. More importantly, it can 
be further applied to many other machine learning 
problems, considering the wide use of Laplacian 
regularization. 
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(4) Our method has been shown to significantly outper- 
form other latent semantic learning approaches |13|, 
lfTTl -[19l, which turns to be more impressive given 
that we do not use feature pruning Q, lfl8l . [19|, 
multiple types of low-level features fTJ, ifTTl . Ifl8l . 
or spatio-temporal layout information fTTJ, IfTTl . If33l 
for human action recognition. 

The remainder of this paper is organized as follows. Sec- 
tion [LI] gives a brief review of related work. Section [III] 
proposes a latent semantic learning method based on structured 
sparse representation. In Section [IVj we present the details of 
human action recognition with SVM using our learnt latent 
semantics. In Section [V] our method is evaluated on the com- 
monly used KTH action dataset and unconstrained YouTube 
action dataset. Finally, Section [VTJ gives the conclusions. 

II. Related Work 

Our method differs from other latent semantic learning 
approaches based on latent topic fl3l . Ifl4l or information 
theoretic models IfTTl , IfTSl in that the manifold structure of 
mid-level features can be explored in both L\ -graph construc- 
tion and spectral embedding, which results in compact but 
discriminative high-level features. Although the diffusion map 
method |fT9l can also exploit this manifold structural infor- 
mation for latent semantic learning, it requires fine parameter 
tuning for graph construction which can significantly affect 
the performance. In contrast, our method can construct the Li- 
graph in a parameter-free manner by structured sparse coding 
with Li-norm hypergraph regularization. More importantly, as 
shown in later experiments, our spectral embedding with Li- 
graph can help to discover more intrinsic manifold structure of 
mid-level features and thus learn more compact but discrimi- 
native latent semantics. Here, it should be noted that we focus 
on parameter-free graph construction for manifold learning in 
this paper and thus only adopt the commonly used spectral 
embedding method introduced in IfZTTl . without considering 
other manifold learning techniques developed in the literature. 

Although our latent semantic learning method can be re- 
garded as dimensionality reduction over mid-level features, 
it completely differs from the traditional dimensionality re- 
duction approaches l20l . ||22| based on spectral embedding. 
Firstly, the latent semantics learnt by our method can help to 
form high-level representation and thus bridge the semantic 
gap to some extent. This is also the reason why the topic 
models lfT3"1l . Ifl4l for latent semantic analysis are widely used 
for multimedia information processing. However, the tradi- 
tional dimensionality reduction approaches based on spectral 
embedding fail to give explicit explanation of each reduced 
feature by directly using the eigenvectors of the Laplacian 
matrix to form the new feature representation. Secondly, our 
latent semantic learning method by spectral embedding with 
L\ -graph over mid-level features incurs much less time cost 
than the traditional dimensionality reduction approaches by 
spectral embedding with graphs over all the data. 

In the literature, many efforts have been made to explore 
sparse coding [27], [28] for different difficult tasks. However, 
this paper makes the first attempt to apply structured sparse 



coding to latent semantic learning for action recognition. More 
importantly, we have developed a novel structured sparse 
coding algorithm for L\ -graph construction with the structured 
sparsity being ensured by Li-norm hypergraph regularization, 
different from the traditional L\ -graph construction methods 
1 34), [35 1 without considering the structured sparsity. Here, 
it should be noted that our new Li-norm hypergraph regu- 
larization is defined directly over all the eigenvectors of the 
hypergraph Laplacian matrix, other than the p-Laplacian reg- 
ularization lf36l as an ordinary L\ -generalization (with p = 1) 
of the Laplacian regularization. Moreover, although Laplacian 
regularization is also combined with sparse coding in visual 
keyword generation lf3~7l . it is just a quadratic term and thus 
is hard to be used in parameter-free L\ -graph construction for 
learning latent semantics from visual keywords. In contrast, 
our Li-norm hypergraph regularization can be readily used 
for parameter-free L\ -graph construction as an Li-norm term. 
We will provide further comparison to lf36l . lf3~7l in Section 
lirTI In this paper, we focus on exploring the manifold structure 
of mid-level features in spare coding, regardless of other types 
of structured sparsity lf3~8l . |f39l . 

Since our main goal is to learn compact but discriminative 
latent semantics from abundant mid-level features for human 
action recognition, we consider very simple experimental set- 
ting in this paper. For example, only a single type of low-level 
spatio-temporal descriptors are extracted from action videos 
just the same as [6|. Moreover, the learnt high-level features 
are directly applied to action recognition without considering 
their spatio-temporal layout information. In fact, we do not 
use feature pruning [7|, [18|, [19], multiple types of low-level 
features Q, IfTTl . ifTSl . or spatio-temporal layout information 
IfTTl . IfTTl . If33l for action recognition. However, even with 
such simple setting, our method can still achieve improvements 
with respect to the state of the arts, as shown in our later 
experiments. 

III. Latent Semantic Learning with Structured 
Sparsity Representation 

In this section, we first propose a sparse coding algorithm 
to construct L\ -graph for spectral embedding over mid-level 
features. To explore structured sparsity in Li-graph construc- 
tion, we further improve the sparse coding algorithm with L\- 
norm hypergraph regularization. Finally, in the new embedding 
space, we learn latent semantics from abundant mid-level 
features by spectral clustering. 

A. Spectral Embedding with L\-Graph 

Given a vocabulary of mid-level features V m = {m*}^, 
each video can be represented as a histogram of mid-level 
features {c n {rrii) : i — 1,...,M}, where c n (nii) is the 
count of times that m,; occurs in video n (n = 1,...,N). 
Based on this BOW representation, our goal is to discover the 
manifold structure of mid-level features by spectral embedding 
with graphs for learning compact but discriminative latent 
semantics, which is different from the topic models lfT3l . 
lfl4ll for latent semantic analysis. Although many spectral 
embedding methods have been developed in previous work, 
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this paper focuses on graph construction as the key step of 
spectral embedding. That is, once a graph is constructed, we 
can adopt any spectral embedding method to discover the 
manifold structure hidden among mid-level features. Since the 
traditional graph construction method proposed in [19] has 
difficulty in choosing the variance for the Gaussian function, 
we thus construct a graph with sparse representation (i.e. 
L\ -graph) in a parameter- free manner, inspired by recent 
advances in sparse coding |27|, [28|. Specifically, we first 
represent each mid-level feature m-j (i = 1, M) as a vector 
Xi = {c n (mi) : n — 1, ...,jV} and then find the solution of 
linear reconstruction of m% using the rest of mid-level features 
based on sparse coding. Since the obtained sparse coefficients 
for linear reconstruction can be used to define the similarity 
between mid-level features, we succeed in constructing an Li- 
graph for spectral embedding. 

Our L\ -graph construction by linear reconstruction with 
sparse coding is presented in detail as follows. For each 
mid-level feature m.j (i = 1,...,M), we suppose it can be 
reconstructed using the rest of mid-level features, which results 
in an underdetermined linear system: xi = -BjO,, where 
Xi £ R N is the vector of rrii to be approximated, on G R M ~ l 
is the vector for unknown reconstruction coefficients, and 
Bi = [xi,x 2 ,-,x i - 1 ,Xi + i,...,x M ] G i?^^- 1 ) is the 
overcomplete dictionary with M — 1 bases. According to l27ll . 
if the solution for xi is sparse enough, it can be recovered by: 



= BiOL, 



(1) 



where ||ai||i is the Li-norm of oti. Considering that the struc- 
tured sparsity term (i.e. Li-norm hypergraph regularization) is 
defined over all mid-level features, we need to use x\ also as 
a base and thus reformulate the above spare representation 
problem as follows: 



ICjaiHi, s.t. Xi = Bat 



(2) 



where on 6 R is the new vector for unknown reconstruction 
coefficients, B — [x±, xm] £ R NxM is the overcomplete 
dictionary with M bases, and Cj G R MxM is a diagonal 
matrix with its (j, j)-element Ci(j,j) = +oo(j = i) and 
= l(i 7^ *)■ Due to such special form of Ci, we 
always have on(i) = for problem (2), where 04(1) is the 
i-th element of on. This means that we can obtain a solution 
equivalent to that of the original problem (1). Here, it should 
be noted that the distinct advantage of the above reformulation 
is that the Li-norm hypergraph regularization defined over 
all mid-level features can now be readily explored in spare 
representation, which will be shown in the next subsection. 
Moreover, if we set on — CiOti, the spare representation 
problem (2) can be transformed into: 



(3) 



which takes the same form as the original problem (1). In 
practice, due to the noise, we can reconstruct xi similar to 
[28 1: Xi — BC~ 1 di+Q, where Q is the noise term. The above 
problem can then be redefined by minimizing the Li-norm of 




(a) Lj-Graph 



(b) Affinity Matrix 



Fig. 2. Illustration of the L\ -graph constructed by sparse coding. The 
vocabulary of mid-level features V m = {fli}f_i, and the set of five 
videos are represented as: video 1 = {7712,777,5}, video 2 = {mi,m2,m3}, 
video 3 = {7714,7715,7713}, video 4 = {mg,mf,ms}, and video 5 
= {rr?,2, 7715, 7716, 7778}. If we assume that each mid-level feature occurs only 
once in a video, the affinity matrix given by Fig. [2Jb) can be computed by 
sparse coding, and the corresponding L\ -graph is shown in Fig. |2ja) where 
each graph edge only denotes whether a pair of mid-level features are related 
and its length has no meaning. 



(4) 



TlT 



both reconstruction coefficients and reconstruction error: 

min ||Q!j||i, s.t. Xi = B'jOi^ 

a' 

where B[ = [BCr\l] e R N *( N + M ) and a' t = [af,C 
This convex optimization can be transformed into a general 
linear programming problem and thus has a globally optimal 
solution. After we have obtained the reconstruction coefficients 
for all the mid-level features, we can define an affinity matrix 
A = {a,ij}njxM as follows (considering the special form of 
Ci): 



0, j = i, 



(5) 



where c^(j) is the j-th element of the vector a[. By setting 
A = (A + A T )/2, we can construct an undirected graph Q = 
{V, ^4} with the vertex set V being the vocabulary V m . In the 
following, we will called it as Li-graph, since it is obtained 
by L\ -optimization. 

Due to the sparse representation given by equation (4), 
each mid-level feature (i.e. vertex) in this Li-graph is only 
related to several other mid-level features (see an example 
shown in Fig. [2j. Although the traditional fc-nearest neighbors 
(fc-NN) graph also has such sparse property, our Li-graph 
constructed by linear reconstruction with sparse coding has a 
distinct advantage, i.e., we can determine the number of related 
mid-level features automatically for each mid-level feature and 
thus do not need to set it as a fixed value like fc-NN graph. 
For example, we can observe from Fig. |2|a) that each mid- 
level feature is related to different number of related mid- 
level features. Moreover, another advantage of our Li-graph 
is that the similarity between mid-level featured is learnt by 
sparse coding in a parameter-free manner, while the traditional 
graph construction method proposed in Ifl9ll has difficulty in 
choosing the variance for the Gaussian function given that it 
is used to characterize the similarity measure. 

Based on the above Li -graph, we further perform spectral 
embedding to discover the manifold structure of mid-level 
features. The goal of spectral embedding is to represent each 
vertex in the L\ -graph as a lower dimensional vector that 
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preserves the similarities between the vertex pairs. Actually, 
this is equivalent to finding the leading eigenvectors of the 
normalized Laplacian matrix 

£ = I - D- 1/2 AD~ 1/2 , (6) 

where D is a diagonal matrix with its (i, i)-element equal 
to the sum of the z-th row of the affinity matrix A. In this 
paper, we only consider this type of normalized Laplacian [21 1, 
regardless of other normalized versions lEUl . Let {(Aj,Vi) : 
i = 1,...,M} be the set of eigenvalues and the associated 
eigenvectors of C, where < Ai < ... < Am an d v F v i = !■ 
The spectral embedding of the Li-graph is given by 



E = [vi, vk] 



(7) 



where the j-ih row Ej. of the matrix E can be regarded as 
the new representation for vertex rrij. Here, it should be noted 
that we focus on parameter-free ii-graph construction for 
manifold learning in this paper and thus only adopt the spectral 
embedding method introduced in ll2D . without considering 
other manifold learning techniques that have been developed 
in the literature. Since we usually set K < M, the mid-level 
features have actually been represented as lower dimensional 
vectors which can be further used for latent semantic learning 
by spectral clustering. 

B. L\-Graph Construction with Structured Sparsity Represen- 
tation 

In the above L\ graph, the similarity between mid-level 
features is defined as the reconstruction coefficients of the 
linear reconstruction solution obtained by sparse coding. How- 
ever, the structured sparsity of these reconstruction coefficients 
is ignored in such sparse representation. In this paper, we 
only consider one special type of structure, i.e., the manifold 
structure of the mid-level features. Actually, this manifold 
structure can be explored in sparse representation based on 
the normalized Laplacian matrix of the hypergraph [ 30 1— 
ll32l defined over mid-level features, which is well known 
as Laplacian regularization or hypergraph regularization. The 
distinct advantage of hypergraph regularization is that the 
structured sparsity can be ensured for sparse representation 
and thus we can obtain new structured sparse representation 
for Li-graph construction. Since the hypergraph plays an 
important role in structured sparse representation, we will first 
give the details of hypergraph construction. 

In fact, the hypergraph can be constructed in a parameter- 
free manner. That is, each video that contains multiple mid- 
level features is regarded as a hyperedge, and its weight can be 
estimated based on the original cluster centers associated with 
mid-level features. Suppose each video is represented as a his- 
togram of mid-level features {cj(m,i) : i = 1,...,M}, where 
Cj(rrii) is the count of times that mid-level feature mi occurs 
in video j (j — 1,...,N). The hypergraph Q = {V,£, fl, w} 
can be constructed as follows. We first set V = V m = {mi}^L l 
and £ — {ej : ej = {mi : Cjirrii) > 0, i = 1, ...,M}}jL v 
The incidence matrix fl of the hypergraph Q can be directly 
defined by 



Hij = cjirrii)/ 



E 



i{m v ). 



(8) 



video 1,'''" ,'' 1 * *\ » 6 1 i- video 5 



(a) Hypergraph (b) Incidence Matrix 

Fig. 3. Illustration of the hypergraph constructed in a parameter-free manner. 
In Fig.fJJa), each dashed ellipse denotes a hyperedge (i.e. video), and each red 
solid node denotes the vertex (i.e. mid-level feature). The incidence matrix 
H of the hypergraph given by Fig. [5Jb) is computed using the occurrences 
of mid-level features within videos. 



Here, we consider a soft incidence matrix (i.e. fly e [0, 1]), 
which is different from [31 1 with Hij = 1 or 0. Moreover, we 
define the hyperedge weights w = {w(ej)}f =1 by 



w(ej) 



- y 



(9) 



where 



denotes the number of vertices within e,, and R 



is the linear kernel matrix defined with the original cluster 
centers associated with mid-level features. This ensures that 
the weight of ej is set to a larger value when this hyperedge is 
more compact. Given these hyperedge weights, we can define 
the degree of a vertex £ V as d(m,i) = Yle e£ w ( e j)Hij- 
For a hyperedge e,- £ £, its degree is defined as 8{ef) = 
Em eV-^'i' ^ n exam pl e hypergraph is shown in Fig. [3] 

It is worth noting that the above hypergraph construction 
method is parameter-free, which is similar to our Li-graph 
construction method by linear reconstruction with sparse cod- 
ing. More importantly, according to [30|, the above hypergraph 
can capture the high order correlation between mid-level 
features. Moreover, to define the hypergraph regularization 
term for sparse representation, we first compute the normalized 
Laplacian matrix the same as ||3T1 : 

C h =I- D- 1 > 2 HWD- 1 H T D- X '' 1 , (10) 

where Dv, De, and W denote the diagonal matrices of the 
vertex degrees, the hyperedge degrees, and the hyperedge 
weights, respectively. Based on this normalized Laplacian 
matrix Ch, we can then define the hypergraph regularization 
term for the sparse representation problem (2) as afChCti, 
which can also be regarded as a smoothness measure of ai 
over the hypergraph. 

However, this hypergraph regularization term is hard to be 
directly incorporated into the sparse representation problem 
(2), no matter as a part of the objective function or a constraint 
condition. Hence, we further formulate an Li-norm version of 
hypergraph regularization as: 



\BhOti 



(ID 



where £?/,. = , Vh is an M X M orthonormal matrix with 

each column being an eigenvector of C^, and E/, isanMxM 
diagonal matrix with its diagonal element £/j(i, i) being an 
eigenvalue of C h (sorted as E ft (l, 1) < ... < S h (M,M)). 
Given that Ch is nonnegative definite, > (i.e. all the 
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eigenvalues > 0). Since ChVh — Vh^h an d Vh is orthonormal, 
we have Ch — Vh^hYh ' ■ Hence, the original hypergraph 
regularization can be reformulated as: 



afC h ai = aJVh^l^lV^ai = aj B^B h ai 



B hai \\l, (12) 
i is indeed 



which means that our new formulation \\BhU 
an ii-norm version of the original hypergraph regularization. 
By introducing noise terms for linear reconstruction and 
Li-norm hypergraph regularization, we transform the sparse 
representation problem (2) into 

[{C iai ) T 



mm 



s.t. Xi = Bcti +Q, = BhOLi + 



(13) 



where the reconstruction error and hypergraph smoothness 
with respect to on are controlled by Q and £j, respectively. 
If we set on = CiOti, we can reformulate the above problem 
as 



mm 



\®-i i Ci : 4*z ] 



s.t. Xi = BC i 1 a i + d 



= B h C- 1 a l +£ u (14) 



Let a- 



TlT 



TlT 



B' 



BC~ 
BhCi 



I 
I 



and 



We finally solve the following structured 



spare representation problem for Li-graph construction: 

min Hal-lli, s.t. x'; = B'.a'j, 



(15) 



which takes the same form as the original spare representation 
problem (4). The affinity matrix A of the L\ -graph can be 
defined the same as equation (5). 

In the above formulation of structured spare representation, 
our Li-norm hypergraph regularization can be smoothly in- 
corporated into the original sparse representation problem (2). 
However, this is not true for the tradition hypergraph regu- 
larization or Laplacian regularization, which may introduce 
extra parameters into the L\ -optimization and thus has conflict 
with our original goal of parameter-free Li-graph construction. 
Moreover, our ii-norm hypergraph regularization can cause 
another type of sparsity (see the extra noise term which 
can not be ensured by the tradition Laplacian regularization. 
These are also the main differences between our structured 
spare coding for high-level latent semantic learning and the 
Laplacian sparse coding proposed in 11371 for mid-level feature 
generation. Here, it should be noted that the p-Laplacian 
regularization [36 1 can also be regarded as an ordinary L\- 
generalization of the Laplacian regularization with p = 1. 

M{M-1) w 

By defining a matrix C v S R 2 , the p-Laplacian 

regularization can be formulated as ||C p aj||i l40l . similar 
to our ii-norm hypergraph regularization. Hence, we can 
apply p-Laplacian regularization similarly to structured spare 
representation. However, it incurs too large time cost due to the 
large matrix C p even for a moderate vocabulary size M = 500. 

C. Latent Semantic Learning by Spectral Clustering 

After the Li -graph has been constructed with structured 
spare representation, we perform spectral embedding using 



the normalized Laplacian matrix. In the new low-dimensional 
embedding space, we learn high-level latent semantics by 
spectral clustering. The algorithm is summarized as follows: 

(1) Find K smallest nontrivial eigenvectors Vi, vk of 
the normalized Laplacian matrix C of the Li-graph 
constructed with structured spare representation. 

(2) Form E = [vi, ...,Vk], and normalize each row of 
E to have unit length. Here, the i-th row Ei, is a new 
low-dimensional feature vector for mid-level feature 
m l . 

(3) Perform fc-means clustering on the new feature vec- 
tors Ei.(i = 1, M) to partition the vocabulary V m 
of M mid-level features into K clusters. Here, each 
cluster of mid-level features denotes a new high-level 
feature. 

In the following, our latent semantic learning algorithm 
based on spectral embedding with structured sparse repre- 
sentation will be denoted as S 2 LSL (i.e. structured sparse 
latent semantic learning), while the algorithm based on spectral 
embedding only with sparse representation will be denoted 
as SLSL (i.e. sparse latent semantic learning). Since the 
spectral embedding is performed with L\ -graph over mid-level 
features, our algorithm can run efficiently even on a large video 
dataset. 

IV. HUMAN ACTION RECOGNITION WITH SVM 

In this section, we present the details of human action recog- 
nition with SVM using our learnt latent semantics. We first 
derive a new semantics-aware representation (i.e. histogram 
of high-level features) for each video from the original BOW 
representation, and then define a histogram intersection kernel 
based on the new representation for action cognition with 
SVM. 

Let Vh — {hi]f = i be the vocabulary of high-level fea- 
tures learnt from the vocabulary of mid-level features V TO = 
{ m ]}f=i b y our S 2 LSL or SLSL algorithm. The BOW rep- 
resentation with Vh for each video can be derived from the 
original BOW representation with V m as follows. Given the 
count of times c n (rrij) that mid-level feature rrij occurs in 
video n, the count of times c n (hi) that high-level feature hi 
occurs in this video can be computed by: 



c n {hi) 



M 

E 



(16) 



where c(rrij,hi) = 1 if mid-level feature rrij occurs in 
cluster i (i.e. high-level feature hi) according to the above 
spectral clustering and c(rrij,hi) = otherwise. That is, 
each video is now represented as a histogram of high-level 
features. Similar to the traditional BOW representation, the 
above semantics-aware representation can be used to define a 
histogram intersection kernel Khi- 



K 



K HI (n,h) = y^min(c„(/i a ),ca(fei)). 



(17) 



This semantics-aware kernel Khi is further used for human 
action recognition with SVM. 
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Fig. 4. Retrieval examples using mid-level and high-level features on the YouTube action dataset 1 18 1. For each query, four videos with the highest values of 
the histogram intersection kernel are retrieved. The incorrectly retrieved videos (which do not come from the same action category as the query) are marked 
with red boxes. The high-level features are shown to achieve significantly better retrieval results than the mid-level features. 



To provide preliminary evaluation of our learnt latent se- 
mantics, we apply the above semantics-aware kernel to action 
retrieval, and some retrieval examples on the YouTube action 
dataset lfl8l are shown in Fig. |4] Here, we learn 400 high- 
level features from 2,000 mid-level features by our S 2 LSL 
algorithm. We can find that the high-level features can achieve 
significantly better retrieval results than the mid-level features, 
which means that the learnt high-level features can provide a 
semantically more succinct representation but a more discrimi- 
native descriptor of human actions than the mid-level features. 
Moreover, we can also find in the experiments that similar 
dominating high-level features are used to represent the videos 
from the same action category, although their exact meanings 
are unknown. This is also the reason why we call them "latent 
semantics" in this paper like the traditional topic models. In the 
following, we will apply our semantics-aware representation to 
human action recognition on the commonly used KTH action 
dataset [5| and unconstrained YouTube action dataset JT8). 

V. Experimental Results 

In this section, our latent semantic learning method will 
be evaluated on two standard action datasets. We first de- 
scribe the experimental setup, including information of the 
two action datasets and the implementation details. Moreover, 
we compare our latent semantic learning method with other 
closely related methods on the two standard action datasets, 
respectively. 

A. Experimental Setup 

We select two different action datasets for performance eval- 
uation. The first dataset is KTH [5 1 which contains six actions: 
boxing, clapping, waving, jogging, running, and walking. 
These actions are performed by 25 actors under four different 
scenarios. In total, this dataset contains 598 video clips. Since 
KTH has been widely used for performance evaluation in 
human action recognition, we can make direct comparison 
with the state-of-the-art methods using their own results on this 



dataset. The second dataset is YouTube |18| which has lots of 
camera movement, cluttered backgrounds, different viewing 
directions, and varying illumination conditions. Hence, it is 
significantly more complex and challenging than KTH. This 
action dataset contains 11 categories: diving, golf swing- 
ing (g_swinging), horse riding (h_riding), soccer juggling 
(smuggling), swinging, tennis swinging (t_swinging), tram- 
poline jumping (t_Jumping), volleyball spiking (v_spiking), 
basketball shooting (b_shooting), biking, and walking (with 
a dog). Most of them share some common motions such 
as "jumping" and "swinging". The video clips are organized 
into 25 relatively independent groups, where separate groups 
are either taken in different environments or by different 
photographers. The dataset contains 1,168 video clips in total. 
To the best of our knowledge, this is one of the most extensive 
realistic action datasets in the literature. 

To extract low-level features from the two action datasets, 
we adopt the spatio-temporal interest point detector proposed 
in |6). Compared to the 3D Harris-Corner detector (8), it 
generates dense features which can improve the recognition 
performance in most cases. Specifically, this detector makes 
use of 2D Gaussian filter and ID Gabor filters in spatial and 
temporal directions, respectively. A response value is given 
at every position (x,y,t). The interest points are selected at 
the locations of local maximal responses, and 3D cuboids 
are extracted around them. For simplicity, we describe the 
3D cuboids using the flat gradient vectors, which are further 
reduced to 100 dimensions by PCA the same as ifTSl . |fl9l . 
In our experiments, we extract 400 descriptors from each 
video clip for the KTH dataset, while for the YouTube dataset 
more descriptors (i.e. 1,600) are extracted from each vide 
clip since this dataset is more complex and challenging. 
Finally, on the two action datasets, we quantize the extracted 
spatio-temporal descriptors into M mid-level features by k- 
means clustering. Here, it should be noted that we adopt very 
simple experimental setting for low-level feature extraction, 
since we focus on learning compact but discriminative latent 



semantics in this paper. We do not consider pruning low-level 
features Q, fl8l , lfl9l or combining multiple types of low- 
level features 0, ifTTl . |fl8l for human action recognition. 
However, even with such simple setting, our latent semantic 
learning method can still achieve performance improvements 
with respect to the state of the arts, as shown in our later 
experiments. 

Since the diffusion map (DM) method for latent sematic 
learning proposed in Ifl9l has been reported to outperform 
other manifold learning techniques (e.g. Isomap [41 1 and 
Eigenmaps 11421 ) and also the information theoretic approaches 
(e.g. information maximization ifTTll ). we focus on comparing 
our S 2 LSL with DM and do not make direct comparison 
with |T7), ED, E2- In fact, our S 2 LSL has been shown 
in later experiments to perform much better than DM, and 
thus we succeed in verifying the superiority of our method 
indirectly with respect to other manifold learning techniques 
and the information theoretic approaches. Moreover, to show 
the effectiveness of our structured sparse representation, we 
also compare our S 2 LSL with SLSL that does not consider 
the structured sparsity. Finally, our S 2 LSL is compared with 
LDA and BOW, since they are the most widely used in the 
literature. Here, all the methods for comparison except BOW 
are designed to learn latent semantics from a large vocabulary 
of mid-level features. In the following, we select M = 2,000 
for the four latent semantic learning methods (i.e. S 2 LSL, 
SLSL, DM, and LDA). For the two action datasets, we use 
24 actors or groups for training SVM and the rest for testing, 
just the same as previous work [18|, |19|. 

B. Results on the KTH Dataset 

The comparison of the four latent semantic learning meth- 
ods is shown in Fig. [5] We find that our S 2 LSL generally 
performs the best. As compared to SLSL without considering 
the structured sparsity, our S 2 LSL leads to better results in 
most cases, which means that the structured sparsity ensured 
by our new ii-norm hypergraph regularization is very useful 
for learning compact but discriminative latent semantics. In 
fact, the better discriminative ability of the high-level features 
learnt by our S 2 LSL may be due to the fact that the manifold 
structure of mid-level features can be explored by Li-norm 
hypergraph regularization in L\ -graph construction and then 
latent semantic learning. Moreover, we also find that our 
S 2 LSL can always achieve performance improvement over the 
DM method for latent semantic learning |[T9l , which becomes 
more significant when the number of high-level features is 
relatively smaller (e.g. K < 150). The reason may be that our 
S 2 LSL has eliminated the need to tune any parameter for graph 
construction based on structured sparse representation, while 
DM heavily suffers from the difficulty of parameter tuning 
in graph construction since the Gaussian function is used to 
characterize the similarity between mid-level features. Here, it 
should be noted that such parameter tuning can significantly 
affect the performance and has been noted as an inherent 
weakness of graph-based methods. 

Similar to topic models such as LDA, our S 2 LSL can explic- 
itly learn latent semantics from abundant mid-level features. 




50 100 150 200 250 300 350 400 



Fig. 5. Comparison of the four latent semantic learning methods for human 
action recognition on the KTH dataset. 



TABLE I 

The relative performance of our S 2 LSL as compared to BOW 

(M = 2, 000) ON THE KTH DATASET 



K/M (%) 


2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 


Speed Gain (%) 
Accuracy Gain (%) 


440 337 282 243 215 191 172 155 143 
-1.0 0.6 0.9 1.4 2.0 1.7 1.2 0.9 1.1 



However, since the manifold structure of mid-level features 
can be explored in both Li-graph construction and spectral 
embedding, our S 2 LSL is able to generate more compact but 
discriminative high-level features for human action recogni- 
tion, which can be observed from Fig. [5] Specifically, our 
S 2 LSL is shown to outperform LDA significantly in all cases, 
which could be due to that our learnt high-level features have 
better discriminative ability. Moreover, we can also find that 
our S 2 LSL consistently achieves promising results with varied 
number of high-level features, while LDA suffers from obvious 
performance degradation for a large number of high-level 
features since the model parameters of LDA increase as K 
grows and thus only local optima may be found in this case. 

In the above experiments, we learn latent semantics from 
M = 2,000 mid-level features. To demonstrate the gain 
achieved by our S 2 LSL, we need to make direct comparison 
to BOW with M = 2,000. Table U shows the relative 
performance of our S 2 LSL, where both speed and accuracy 
gains are computed relatively upon BOW. In particular, to 
obtain the speed gain, we compare the speed of kernel com- 
putation on the high-level features learnt by our S 2 LSL to 
that of kernel computation on the 2,000 mid-level features. 
Here, we only consider kernel computation since the SVM 
classification incurs the same time cost once the kernel matrix 
is provided. From Table U we can observe that our S 2 LSL 
can reduce the number of features to a very low level (e.g. 
5.0%) without obvious performance degradation (or even with 
performance improvement), which is exactly consistent with 
the original goal of latent semantic learning in the literature. 
This nice property of our S 2 LSL can speed up the subsequent 
classification and retrieval significantly, which is extremely 
important for large datasets. 

Our S 2 LSL is further compared to BOW with M = 2, 000 
on each action category. Here, we only consider K = 250 
for our S 2 LSL. To make extensive comparison, we take BOW 
with M — 250 as a baseline method. The comparison between 
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Fig. 6. Comparison between S 2 LSL and BOW for human action recognition 
on the KTH dataset. Here, our S 2 LSL learns 250 high-level features (i.e. 
K = 250) from 2,000 mid-level feature (i.e. M = 2, 000). 



TABLE II 

Comparison of our S 2 LSL with previous methods for action 

RECOGNITION ON THE KTH DATASET (FEATURE PRUNING: FP; MULTIPLE 
FEATURES: MF; SPATIO-TEMPORAL LAYOUT: STL) 



Methods 


FP 


MF 


STL 


Accuracy (%) 


Dollar et al. |6| 


no 


no 


no 


81.2 


Niebles et al. Q3) 


no 


no 


no 


83.3 


Liu and Shah. fTTl 


no 


no 


yes 


94.2 


Bregonzio et al. (7) 


yes 


yes 


no 


93.2 


Liu et al. (T8) 


yes 


yes 


no 


93.8 


Liu et al. (T9) 


yes 


no 


no 


92.3 


Oikonomopoulos et al. 1331 


no 


no 


yes 


88.0 


Wu et al. DTI 


no 


yes 


yes 


94.5 


Our method 


no 


no 


no 


95.1 



our S 2 LSL and these two BOW methods is shown in Fig. [6] 
We can find that our S 2 LSL leads to improvements over BOW 
(M = 2, 000) on four action categories: "boxing", "waving", 
"jogging", and "running", without performance degradation 
on the other categories, even when the number of features is 
decreased from 2,000 to 250. The ability of our HSE to achieve 
promising results using only a small number of features 
is important because it means that the proposed method is 
scalable for large datasets. Moreover, our S 2 LSL is shown 
to perform better than BOW (M = 300) on all the action 
categories when they select the same number of features. 

Since we focus on learning compact but discriminative la- 
tent semantics for human action recognition, we consider very 
simple experimental setting in this paper. For example, in the 
experiments, only a single type of low-level spatio-temporal 
descriptors are extracted from action videos the same as J6). 
Moreover, the learnt high-level features are directly applied 
to human action recognition without considering their spatio- 
temporal layout information. That is, we do not use feature 
pruning (7J, 11181 , |[T9l , multiple types of low-level features |f7| , 
ifTTI . 1 18 1, or spatio-temporal layout information lITD . ifTTl , 
ll33l for action recognition. However, even with such simple 
experimental setting, our S 2 LSL method can still achieve 
performance improvements with respect to the state of the arts, 
as shown in Table [II] This also provides further convincing 
validation of the effectiveness of our latent semantic learning 
method based on structured sparse representation by Li-norm 
hypergraph regularization. 




Fig. 7. Comparison of the four latent semantic learning methods for human 
action recognition on the YouTube dataset. 

TABLE in 

The relative performance of our S 2 LSL as compared to BOW 
(M = 2, 000) on the YouTube dataset 



K/M (%) 


2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 


Speed Gain (%) 
Accuracy Gain (%) 


973 701 544 441 365 308 268 235 207 
-14.1 -6.6 -3.0 -1.1 -1.5 1.6 1.7 1.9 -0.5 



C. Results on the YouTube Dataset 

The YouTube dataset is more complex and challenging than 
KTH, since it has lots of camera movement, cluttered back- 
grounds, different viewing directions, and varying illumination 
conditions. We repeat the same experiments on this dataset, 
and the recognition results are shown in Fig. [7] Table llHl and 
Fig. [8] Here, the four latent semantic learning methods are 
compared in Fig. [7] while in Table HTT1 and Fig. [8] we focus on 
comparing our S 2 LSL directly with BOW to show the relative 
gain achieved by our S 2 LSL. The speed gain in Table [Til] is 
still computed when only kernel computation is concerned. 
Overall, we can make the same observations on this dataset 
as we have done with the KTH dataset. 

Specifically, our S 2 LSL can generally achieve better per- 
formance than the other latent semantic learning approaches. 
This observation further verifies that our S 2 LSL can learn 
more compact but discriminative latent semantics by exploring 
the manifold structure of mid-level features in both graph 
construction and spectral embedding. Moreover, the compact 
set of high-level features learnt by our S 2 LSL can speed up the 
subsequent kernel computation significantly without obvious 
performance degradation (when K/M > 5.0%), as shown 
in Table HID In particular, when our S 2 LSL (K = 400) is 
compared to BOW (M = 400 or 2,000) on each action 
category, we can observe from Fig. [8] that our S 2 LSL leads 
to performance improvements over BOW on most action 
categories. The reason may be that the high-level features 
learnt by our S 2 LSL can help to reduce the semantic ambiguity 
of the most confusing action categories. When we focus 
on the comparison between S 2 LSL (K = 400) and BOW 
(M — 2, 000), the performance improvement achieved by 
our S 2 LSL is really impressive given that we have decreased 
the number of features from 2,000 to 400. Although the 
commonly used LDA can do the same thing as our our S 2 LSL, 
it completely fails in this case, as shown in Fig. [7] Considering 
the superior performance of LDA reported in the literature and 
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Fig. 8. Comparison between S 2 LSL and BOW for human action recognition 
on the YouTube dataset. Here, our S 2 LSL learns 400 high-level features (i.e. 
K = 400) from 2,000 mid-level feature (i.e. M = 2, 000). 

also the promising results reported in this paper, our S 2 LSL 
can be regarded as the best among different representative 
latent semantic learning approaches. 

VI. Conclusions 

We have investigated the challenging problem of latent se- 
mantic learning in the application of human action recognition. 
To bridge the semantic gap associated with human action 
recognition, we have proposed a novel latent semantic learning 
method based on structured sparse representation. To exploit 
the manifold structure of mid-level features for latent semantic 
learning, we have developed a spectral embedding approach 
based on the Li-graph constructed with structured sparse 
representation in a parameter-free manner, without the need 
to tune any parameter for graph construction. Although many 
efforts have been made to explore both sparse representation 
and hypergraph in different applications in the literature, we 
have made the first attempt to integrate them in a unified 
structured sparse representation framework for latent semantic 
learning. The experimental results have demonstrated the 
superior performance of our latent semantic learning method. 
In the future work, considering the wide use of Laplacian reg- 
ularization, our new Li-norm hypergraph regularization will 
be explored in many other machine learning problems to take 
into the structured sparsity into account. Moreover, our latent 
semantic learning method will be extended to other difficult 
tasks such as image annotation and scene classification. 
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